Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV parsing functions #2361

Merged
merged 11 commits into from
Jan 13, 2025
Merged

Add CSV parsing functions #2361

merged 11 commits into from
Jan 13, 2025

Conversation

GuntherRademacher
Copy link
Member

These changes add implementations of the XQuery 4.0 functions

  • fn:csv-to-arrays,
  • fn:parse-csv, and
  • fn:csv-to-xml.

The implementation uses the same CSV parser as csv:parse, adapting it by adding new options to integrate additional functionality.

There are 6 new CsvOptions, which now can be used by csv:parse as well:

  • ROW_DELIMITER
  • QUOTE_CHARACTER
  • TRIM_WHITESPACE
  • TRIM_ROWS
  • SELECT_COLUMNS
  • STRICT_QUOTING

All but STRICT_QUOTING are defined in the XQuery 4.0. STRICT_QUOTING = false serves for distinguishing the behaviour of the new functions with respect to quoting from the preserved behaviour of csv:parse.

TRIM_WHITESPACE is not yet implemented as in qt4cg/qtspecs#1677, as it trims whitespace off of quoted fields too. In qt4cg/qtspecs#1675 I made the proposal to additionally allow whitespace outside of quotes. Once these issues have been completed, I will adapt the implementation accordingly.

Empty-line handling had to be changed to conform to the XQuery 4.0 function specification. While empty lines used to be skipped by csv:parse, they are now unconditionally preserved even for that function, such that it now behaves like the new functions with respect to empty lines. Tests have been added to CsvModuleTest and the changed behaviour has been annotated like this:

            // was: "<csv/>");
    parse("\n", "", "<csv><record/></csv>");
              // was: "<csv/>");
    parse("\n\n", "", "<csv><record/><record/></csv>");

With these changes BaseX passes most of the QT4 tests for the new functions. The remaining test failures are for different error codes than expected, e.g.

parse-csv-907
fn:parse-csv('one,two', map{'row-delimiter':('|','||')})
Error : FOCV0002: The value of row-delimiter is not a single character: | ||.
Expect: XPTY0004

parse-csv-914
parse-csv("a,b,c,d,e,f|p,q,r,s,t,u", map{'row-delimiter':'|', 'select-columns':(4,3,1)})?get(-1, 2)
Error : FORG0001: Cannot convert xs:integer to xs:positiveInteger: -1.
Expect: XPTY0004

Copy link
Member

@ChristianGruen ChristianGruen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great.

@ChristianGruen ChristianGruen merged commit a7c6c12 into BaseXdb:main Jan 13, 2025
@ChristianGruen ChristianGruen deleted the csv branch January 13, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants